Validating module network learning algorithms using simu- 
lated data 
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Abstract 

Background: In recent years, several authors have used probabilistic graphical models to learn expression modules 
and their regulatory programs from gene expression data. Despite the demonstrated success of such algorithms 
in uncovering biologically relevant regulatory relations, further developments in the area are hampered by a lack 
of tools to compare the performance of alternative module network learning strategies. Here, we demonstrate the 
use of the synthetic data generator SynTRelM for the purpose of testing and comparing module network learning 
algorithms. We introduce a software package for learning module networks, called LeMolMe, which incorporates 
a novel strategy for learning regulatory programs. Novelties include the use of a bottom-up Bayesian hierarchical 
clustering to construct the regulatory programs, and the use of a conditional entropy measure to assign regulators 
to the regulation program nodes. Using SynTReN data, we test the performance of LeMolMe in a completely 
controlled situation and assess the effect of the methodological changes we made with respect to an existing 
software package, namely Genomica. Additionally, we assess the effect of various parameters, such as the size of 
the data set and the amount of noise, on the inference performance. 

Results: Overall, application of Genomica and LeMolMe to simulated data sets gave comparable results. However, 
LeMoNe offers some advantages, one of them being that the learning process is considerably faster for larger 
data sets. Additionally, we show that the location of the regulators in the LeMoNe regulation programs and their 
conditional entropy may be used to prioritize regulators for functional validation, and that the combination of the 
bottom-up clustering strategy with the conditional entropy-based assignment of regulators improves the handling 
of missing or hidden regulators. 

Conclusions: We show that data simulators such as SynTReN are very well suited for the purpose of developing, 
testing and improving module network algorithms. We used SynTReN data to develop and test an alternative 
module network learning strategy, which is incorporated in the software package LeMoNe, and we provide evidence 
that this alternative strategy has several advantages with respect to existing methods. 
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Background 

For the past 45 years, research in molecular biology 
has been based predominantly on reductionist think- 
ing, trying to unravel the complex workings of living 
organisms by investigating genes or proteins one at a 
time. In recent years, molecular biologists have come 
to view the cell from a different, more global perspec- 
tive. With the advent of fully sequenced genomes 
and high-throughput functional genomics technolo- 
gies, it has become possible to monitor molecular 
properties such as gene expression levels or protein- 
DNA interactions across thousands of genes simulta- 
neously. As a consequence, it has become feasible to 
study genes, proteins and their interactions in the 
context of biological systems rather than in isola- 
tion. This novel paradigm has been named 'systems 
biology' [1]. 

One of the goals of the systems approach to 
molecular biology is to reverse engineer the regula- 
tory networks underlying cell function. Particularly 
transcriptional regulatory networks have received a 
lot of attention, mainly because of the availabil- 
ity of large amounts of relevant experimental data. 
Several studies use expression data, promoter motif 
data, chromatin immunoprecipitation (ChIP) data 
and/or prior functional information (e.g. GO classi- 
fications [2] or known regulatory network structures) 
in conjunction to elucidate transcriptional regula- 
tory networks [3-17]. Most of these methods try 
to unravel the control logic underlying specific ex- 
pression patterns. This type of analysis typically re- 
quires elaborate computational frameworks. In par- 
ticular probabilistic graphical models are considered 
a natural mathematical framework for inferring reg- 
ulatory networks [8]. Probabilistic graphical models, 
the best-known representatives being Bayesian net- 
works, represent the system under study in terms of 
conditional probability distributions describing the 
observations for each of the variables (genes) as a 
function of a limited number of parent variables (reg- 
ulators), thereby reconstructing the regulatory net- 
work underlying the observations. Friedman et al. 
pioneered the use of Bayesian networks to learn reg- 
ulatory networks from expression data [3,4]. In these 
early studies, each gene in the resulting Bayesian 
network is associated with its individual regulation 
program, i.e., its own set of parents and conditional 
probability distribution. A key limitation of this ap- 
proach is that a vast number of structural features 
and distribution parameters need to be learned given 
only a limited number of expression profiles. In other 



words, the problem of finding back the real network 
structure is typically heavily undcrdctermincd. An 
attractive way to remedy this issue is to take ad- 
vantage of the inherent modularity of biological net- 
works [18], specifically the fact that groups of genes 
acting in concert are often regulated by the same 
regulators. Segal et al. [6, 19] first exploited this 
idea by proposing module networks as a mathemat- 
ical model for regulatory networks. Module net- 
works are probabilistic graphical models in which 
groups of genes, called modules, share the same par- 
ents and conditional distributions. As the number 
of parameters to be estimated in a module network 
is much smaller than in a full Bayesian network, the 
currently available gene expression data sets can be 
large enough for the purpose of learning module net- 
works [6,11,12,19]. 

Despite the demonstrated success of module net- 
work learning algorithms in finding biologically rel- 
evant regulatory relations [6, 11, 12, 19], there is only 
limited information about the actual recall and pre- 
cision of such algorithms [12] and how these perfor- 
mance measures are influenced by the use of alter- 
native module network learning strategies. Having 
the means to answer the latter question is key to the 
further development and improvement of the module 
networks formalism. 

The purpose of the present study is twofold. 
First, we introduce a novel software package for 
learning module networks, called LeMoNe, which is 
based on the general methodology outlined in Segal 
et al. [6] but incorporates an alternative strategy for 
inferring regulation programs. 

Second, we demonstrate the use of SynTRcN 
[20], a data simulator that creates synthetic regula- 
tory networks and produces simulated gene expres- 
sion data, for the purpose of testing and comparing 
module network learning algorithms. We use Syn- 
TRcN data to assess the performance of LeMoNe 
and to compare the behavior of alternative module 
network learning strategies. Additionally, we assess 
the effect of various parameters, such as the size of 
the data set and the amount of noise, on the in- 
ference performance. For comparison, we also use 
LeMoNe to analyze real expression data for S. cere- 
visiae [21] and investigate to what extent the qual- 
ity of the module networks learned on real data can 
be automatically assessed using structured biological 
information such as GO information and ChlP-chip 
data [9]. 
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Methods 
Data sets 

We used SynTRcN [20] to generate simulated data 
sets for a gene network with 1000 genes of which 
105 act as regulators. The topology of the network 
is subsampled from an E. coli transcriptional net- 
work [29] by cluster addition, resulting in a network 
with 2361 edges. All parameters of SynTRcN were 
set to default values, except number of correlated 
inputs, which was set to 50%. SynTReN generated 
expression values ranging from (no expression) to 
1 (maximal expression) which we normalized to log 2 
ratio values by picking one of the experiments as the 
control. Except where indicated otherwise, the list 
of true regulators was given as the list of potential 
regulators for LeMoNe and Genomica. 

For the tests performed on real data, we used an 
expression compendium for S. cerevisiae containing 
expression data for 173 different experimental stress 
conditions [21]. The data were obtained in prenor- 
malized and preprocessed form. We used the mean 
log 2 values of the expression ratios (perturbation vs. 
control). 

To assess the quality of the regulatory programs 
learned from real data, we used data on genome- 
wide binding and phylogenetically conserved mo- 
tifs for 102 transcription factors from Harbison et 
al. [9]. For a given transcription factor, only genes 
that were bound with high confidence (significance 
level a — 0.005) and showed motif conservation in 
at least one other Saccharomyces species (besides S. 
cerevisiae) were considered true targets. 



Module networks 

Module networks are a special kind of Bayesian net- 
works and were introduced by Segal et al. [6,30]. 
To each gene i we associate a random variable Xi 
which can take continuous values and corresponds 
to the gene's expression level. The distribution of 
Xi depends on the expression level of a set of par- 
ent genes Paj chosen from a list of potential regu- 
lators. If the network formed by drawing directed 
edges from parent genes to children genes is acyclic, 
we can define a joint probability distribution for the 
expression levels of all genes as a product of condi- 
tional distributions, 

N 

p(x!,...,x N ) = Y[pi(x l | {xj: j GPaJ). (1) 

i=l 



This is the standard Bayesian network formalism. 

In a module network we assume that genes are 
partitioned into different sets called modules, such 
that genes in the same module share the same pa- 
rameters in the distribution function (1). Hence 
a module network is defined by a partition of 
{l,...,N} into K <C N modules Ak such that 
UfcLi A k = {1, . . . , N} and A k D Ay = for k ^ ft', 
a collection of parent genes for each module ft, 
and a joint probability distribution 

K 

p(xi, ...,x N )=y[ n i "■ j e 

fe=i ieA k 

(2) 

The conditional distribution pk of the expression 
level of the genes in module ft is normal with mean 
and standard deviation depending on the expression 
values of the parents of the module through a re- 
gression tree that is called the regulation program of 
the module. The tests on the internal nodes of the 
regression tree are of the form x ^ s for some split 
value s, where x is the expression value of the parent 
associated to the node (Figure 6). 

The Bayesian score is obtained by taking the 
log of the marginal probability of the data likeli- 
hood over the parameters of the normal distribu- 
tions at the leaves of the regression trees with a 
normal-gamma prior (see [30] and Additional file 1 
for more details; the actual expression for the score is 
in eq. (S5)). Its main property is that it decomposes 
as a sum of leaf scores of the different modules: 

S = E S * = EE^)< ( 3 ) 

k k e 

where £t denotes the experiments that end up at 
leaf I after traversing the regression tree. A normal- 
gamma prior ensures that Sk(£e) can be solved ex- 
plicitly as a function of the sufficient statistics (num- 
ber of data points, mean and standard deviation) of 
the leaves of the regression tree (see Additional file 
1). 



Learning module regulation programs 

For a given assignment of genes to modules, find- 
ing a maximum for the Bayesian score (3) consists 
of finding the optimal partitioning of experiments 
into 'leaves' t for each module separately, i.e., find 
a collection of subsets £i C {1,...,M} such that 
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U e £ e = {1, • • • , M}, £ e n£ e , = <b for I ^ and 

S* = £S*(ft) (4) 

is maximal. In particular we do not have to define 
the parent sets life of the modules in order to find 
an optimal partition. 

We use a bottom-up hierarchical clustering 
method to heuristically find a high-scoring partition. 
At each step of the process we have a collection of 
binary trees T a which represent subsets £ a of ex- 
periments. The binary split of T a into its children 
T ai and T a2 corresponds to a partition of the set £ a 
into two sets: £ a = £ ai U £ a2 . The initial collection 
consists of trivial trees without children representing 
single experiments. To proceed from one collection 
of trees to the next, the pair of trees with highest 
merge score is merged into a new tree, and the col- 
lection of binary trees decreases by one, eventually 
leading to one hierarchical tree To representing the 
complete experiment set £q = {1, . . . , M}. The sim- 
plest merge score is given by the possible gain in 
Bayesian score by merging two experiment sets: 

r au a 2 = Sk(£ ai U £ a2 ) - S k (£ ai ) - S k (£ a2 ). (5) 

In Additional file 1 we define an alternative merge 
score related to the Bayesian hierarchical clustering 
method of [31]. This merge score takes into account 
the substructure of the trees below T ai and T a2 in 
addition to the Bayesian score difference (5), and 
tends to produce more balanced trees. 

In the final step, we need to cut the hierarchical 
tree To. To this end we traverse the tree from the 
root towards its leaves. If we are at a subtree node 
T a with children T ai and T a2 , we compute the score 
difference (5). If this difference is negative, the to- 
tal score is improved by keeping the split T a , and 
we move on to test each of its children nodes. If 
the difference is positive, the total score is improved 
by not making the split T a , and we remove its chil- 
dren nodes from the tree. The experiment set £ a 
becomes one of the leaves of the regulation program, 
contributing one term in the sum (4). 

The pseudocode for the regulation program 
learning algorithm is given in Figure S3 in Addi- 
tional file 1. 

In [6,30], regulation programs are learned top- 
down by considering all possible splits on all cur- 
rent leaves with all potential regulators, so regu- 
lation trees and regulator assignments are learned 



simultaneously. As a result missing regulators or 
noise in the regulator data might lead to a subop- 
timal partitioning of the experiments in a module. 
In our approach we have focused on finding an op- 
timal partition of the module regardless of the set 
of potential regulators. A module collects the data 
of many genes and therefore this partition will be 
less affected by noise or missing data than when it 
is determined by exact splits on single regulators. 

Regulator assignment 

At a given internal node T a of the regulation tree 
T , the experiment set £ a is partitioned into two 
distinct sets £ ai and £ a2 according to the tree struc- 
ture. Given a regulator r and split value s, we can 
also partition £ a into two sets 

IZi = {to G £ a : x r;m < s} 

TZ 2 = {TO E £ a ■ 2V,m > S}, 

where x r>m is the expression value of regulator r in 
experiment m. 

Consider now two random variables: E which 
can take the values a.\ or a 2 , and R which can take 
the values 1 or 2, with probabilities defined by sim- 
ple counting, p(E = a\) = \£ ai \/\£ a \, p(R = 1) = 
|7£i|/|£ Q |, etc. We are interested in the uncertainty 
in E given knowledge (through the data) of i?, i.e., 
in the conditional entropy [32] 

H(E | R)=pih( qi )+p 2 h(q 2 ), (6) 

where Pi = p(R = i), h is the binary entropy func- 
tion 

%) = -i i°g(<?) - (i - q) lo g(! - <?), 

and qi are the conditional probabilities 

qi =p(E = ai \R = i)= l£ai ^ ,i = l,2. 

In the presence of missing data, the probabilities pi 
and qi need to be modified to take into account this 
extra uncertainty, details are given in Additional file 
1. 

The conditional entropy is nonncgativc and 
reaches its minimum value when qi = or 1 (and 
consequently q 2 = 1, resp. 0), which means the £ 
and TZ partitions are equal and the regulator - split 
value pair 'explains' the split in the regulation tree 
exactly. Hence we assign to each internal node of a 
regulation tree the regulator - split value pair which 
minimizes the conditional entropy (6). Since this 
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assignment has to be done only once, after the mod- 
ule networks score has converged, the best regulator 
- split value pairs can be found by simply enumer- 
ating over all possibilities, even for relatively large 
data sets. The actual algorithm for assigning regula- 
tors to all nodes operates first on nodes closer to the 
roots of the trees where the most significant splits 
are located, and takes into account acyclicity con- 
straints on the module network. It is presented in 
pseudocode in Figure S4 in Additional file 1. 



Learning module networks 

To find an optimal module network, learning of reg- 
ulation trees is alternated with reassigning genes 
to other modules until convergence of the Bayesian 
score. Module initialization can be done using any 
clustering algorithm. Here, we used fc-means [33], 
and reassigning is done like in [30] by making all 
single-gene moves from one module to another which 
improve the total score. 



Network comparison 

To obtain a gene network from a module network, 
we put directed edges from the regulators of a mod- 
ule to all the genes in that module. We compare 
inferred to true network by computing the number 
of edges that are true positive (tp), false positive (fp) 
and false negative (fn). Standard measures for the 
inference quality are precision and recall. Precision 
(denoted P) is defined as the fraction of edges in 
the inferred module network that is correct, and re- 
call (denoted R) as the fraction of edges in the true 
network that is correctly inferred, i.e., 

P = -^- a=-* 

tp + fp tp + fn 

The -F-measure, defined as the harmonic mean of 
precision and recall, F = pq^, can be used as a 
single measure for inference quality. 

The module content for different module net- 
works can be compared by computing for each mod- 
ule in one network how many genes of it are also 
grouped together in one module in the other net- 
work, and averaging over the number of modules. 
We call this the average module overlap. 



GO overrepresentation analysis 

GO enrichment P-values for all modules were de- 
termined using the BiNGO tool [34], which was in- 
corporated into the LeMoNe package. The overrep- 
resentation of GO Biological Process categories was 
tested using hypergcomctric tests and the resulting 
P- values were corrected for multiple testing using a 
False Discovery Rate correction. 



Software 

The latest version of SynTReN can be downloaded 
from [35] and the latest version of Genomica from 
[27]. LeMoNe is implemented in Java and available 
for download in source or executable form [36]. 



Results and Discussion 

Implementation differences in LeMoNe versus Ge- 
nomica 

As a starting point for the development of LeMoNe, 
we re-implemented the methodology described by 
Segal et al. [6] , which is incorporated in the Genom- 
ica software package. Briefly, Genomica takes as in- 
put a gene expression data set and a list of potential 
regulators. After an initial clustering step, the algo- 
rithm iteratively constructs a regulatory program for 
each of the modules (clusters) in the form of a regres- 
sion tree, and then reassigns each gene to the module 
whose program best predicts the gene's expression 
behavior. These two steps are repeated until con- 
vergence is reached. In this process, the algorithm 
attempts to maximize a Bayesian score function that 
evaluates the model's fit to the data [6]. 

We used the same overall strategy and the same 
Bayesian score function in LeMoNe. However, with 
respect to the original methods described by Segal et 
al. [6], LeMoNe incorporates an alternative strategy 
for inferring regulatory programs that offers some 
advantages (see Methods). First, LeMoNe uses a 
Bayesian hierarchical clustering strategy to learn the 
regulation trees for the modules from the bottom up 
instead of from the top down. Furthermore, con- 
trary to Genomica [6] , the partitioning of expression 
data inside a module is not dependent on the expres- 
sion profiles of the potential regulators, but only on 
the module data itself. This should allow the pro- 
gram to better handle missing or 'hidden' regulators 
(see further). As an additional advantage, the as- 
signment of regulators to regulation program nodes 
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can be postponed until after the final convergence of 
the Bayesian score, which leads to considerable time 
savings (see further). 

A second modification in LeMoNe is that regula- 
tors are assigned to the splits in the regulation tree 
(data splits) based on an information theoretic mea- 
sure, namely the conditional entropy of the partition 
of the regulator's expression profile dictated by the 
data split, given the partition imposed by a partic- 
ular split value (see Methods). As a consequence, 
a data split does not impose, but merely prefers, a 
clean partition of the best-matching regulator's ex- 
pression values around a certain split value. In com- 
parison with Genomica, where only such clean parti- 
tions are used, this strategy has the advantage that 
potential noise in the regulator's expression is taken 
into account. Additionally, the conditional entropy 
can be used to estimate the quality of the regula- 
tor assignment, and thus suggest missing potential 
regulators for splits without a low-entropy regulator. 
Information theory has been used before to analyze 
and cluster gene expression data [13,22-26]. Our 
method introduces elements of information theory 
into the module networks formalism. 

In the following sections, we use SynTReN data 
to test LeMoNe in a completely controlled situation 
in which simulated microarray data is analyzed for 
a known underlying regulatory network of reason- 
able size, and we assess the performance effects of 
the aforementioned methodological changes with re- 
spect to Genomica [6]. The LeMoNe package and 
the source code are freely available under the GPL 
license (see Software section). 

Modularity 

A fundamental assumption of the module networks 
formalism is that real biological networks have a 
modular structure [18] that is reflected in the gene 
expression data, and therefore groups of genes can 
share the same parameters in the mathematical de- 
scription of the network. In LeMoNe, as in other 
module network learning programs [6, 11], the de- 
sired number of modules has to be given as an in- 
put parameter to the inference program, and a main 
question is how the optimal module number has to 
be determined. Fewer modules means lower compu- 
tational cost and more data points per module. This 
results in a better estimation of parameters, but pos- 
sibly entails oversimplifying the network and miss- 
ing important regulatory relations. More modules 



means more specific optimization of the network at 
higher computational cost. When modules become 
too small, there could be too few data points per 
module for a reliable estimation of the parameters. 
In this section we use the Bayesian score to estimate 
the optimal number of modules. 
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Figure 1: Bayesian score as a function of the num- 
ber of modules for data sets with 10, 100 and 300 
experiments (top to bottom). The score is normal- 
ized by the number of genes times the number of 
experiments. The curves are least squares fits of the 
data to a linear non-polynomial model of the form 
ao+Y^k=i dkX k ~ 1 e~ x / b0 ° with x the number of mod- 
ules and n = 6. 

Throughout this manuscript, we make use of a 
SynTReN-generated synthetic network encompass- 
ing 1000 genes of which 105 act as regulators (see 
Methods). Unless otherwise stated, we use all 105 
regulators in this network as potential regulators 
while inferring module networks. Figure 1 shows the 
Bayesian score, normalized by the number of genes 
times the number of experiments, for this network 
and different numbers of experiments. In all three 
panels, the score reaches a maximum. The top panel 
(data set with 10 experiments), which has a true 
maximum for the score, illustrates that the network 
inference problem is underdetermined for very small 
data sets. Increasing the number of modules be- 
yond the location of the maximum lowers the fit of 
the model to the data. For larger data sets (middle 
and bottom panel, 100, resp. 300 experiments), the 
score saturates and after a certain point the model 
does not improve anymore by increasing the number 
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of modules. Hence, the optimal number of mod- 
ules should be situated around the point where the 
Bayesian score starts to level off. For increasing 
number of experiments, the optimal number shifts 
to the right. This suggests that increasing amounts 
of data enable the algorithm to uncover smaller and 
more finetuned modules. However, the rightbound 
shift of the optimum becomes less pronounced for 
increasing number of experiments. This reflects the 
fact that only a limited number of modules are in- 
herently present in the true network. 

We define the number of modules in the true net- 
work as the number of gene sets having the same set 
of regulators (taking into account activator or re- 
pressor type). This number is 286 for the 1000 gene 
synthetic network we consider here, among which 
there are 180 with at least 3 genes and 126 with at 
least 5 genes. The saturation behavior of the score 
curves for 100 and 300 experiments in Figure 1 more 
or less reflects the modularity in the true network. 



Network inference performance 

A more detailed analysis of network inference perfor- 
mance is obtained by comparing the set of regulator 
to gene edges in the true (synthetic) network and in 
the inferred module network. We use standard mea- 
sures such as recall, precision, and F-measure (see 
Methods). 

Figure 2 shows the recall as a function of the 
number of modules for different numbers of experi- 
ments. The location of the recall maxima seems to 
agree well with the saturation points of the corre- 
sponding Bayesian score curves (Figure 1). As ex- 
pected the maximal recall, and hence the total num- 
ber of true positives, increases for data sets with 
more experiments, saturating between 30 and 35% 
for data sets with > 100 experiments. 

A similar saturation with increasing number of 
experiments is seen for the precision curves (Figure 
SI in Additional file 1) and the F-measure curves 
(Figure S2 in Additional file 1). Whereas the preci- 
sion continues to increase with the number of mod- 
ules, the F-measure saturates, but does so at a 
higher number of modules than the Bayesian score. 
Taking into account the modular composition of the 
true network (see previous section), the Bayesian 
score and the recall curves seem to generate better 
estimates of the optimal number of modules than the 
F-measure curves. 
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Figure 2: Recall as a function of the number of 
modules for data sets with 10 (magenta), 50 (cyan), 
100 (red), 200 (green), and 300 (blue) experiments. 
The curves are least squares fits of the data to 
a linear non-polynomial model of the form a + 
12k=i ikX k ^ 1 e~ x ^ 500 with x the number of modules 
and n = 3. 

We also investigated whether the inferred regula- 
tion programs provide any information regarding the 
quality of the regulators. When analyzing real data, 
such information could be useful to prioritize regu- 
lators for experimental validation. A first property 
which we tried to relate to a regulator's quality is its 
hierarchical location in the regulation program. It 
seems that regulators deeper in the regulation tree 
become progressively less relevant. Figure 3 illus- 
trates this effect by showing separately the preci- 
sions for the roots of the regulation trees (level 0), 
the children of the roots (level 1), and the grandchil- 
dren (level 2) for data sets with 100, 200, and 300 ex- 
periments. The precisions for the various regulatory 
levels remain within each others standard deviation 
across the tested range of experiments, but the pre- 
cision clearly diminishes with increasing levels in the 
regulation program. For each data set and inferred 
module network we created an additional network 
where each module is assigned a random regulator 
set of the same size as in the inferred network. The 
precision for these random regulation programs is 
shown in the bottom most curves in Figure 3. For 
regulation levels beyond level 2, the precisions fall in 
this region of random assignments and they add al- 
most exclusively false positives (results not shown). 
In general, we can say that the top regulators are far 
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more likely to represent true regulatory interactions. 
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Figure 3: Precision as a function of the number 
of modules for subnetworks generated by regulation 
tree levels (roots), 1 and 2, and for random assign- 
ments of regulators to regulation tree nodes (top to 
bottom) for data sets with 100 (red), 200 (green) 
and 300 (blue) experiments. The curves are least 
squares fits of the data to a linear non-polynomial 



model of the form ao + J2k=i a ^ x 
the number of modules and n = 3 
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An additional layer of information is provided 
by the regulator assignment entropies. A low value 
of the entropy corresponds to a regulator matching 
well with a split in the expression pattern of the reg- 
ulated module. Hence we expect regulators with low 
entropy to have a higher probability to be true reg- 
ulators. This is illustrated in Figure 4. For the data 
set with 100 experiments and 150 modules, the sub- 
network generated by all regulators with an entropy 
lower than, e.g., 0.1 has precision 0.334, almost twice 
as high as the precision of 0.176 for the whole mod- 
ule network. For the subnetwork generated by the 
regulators at the roots of the regulation trees, the 
precision increases from 0.42 to 0.53 by introducing 
the same entropy cut-off. Other data sets show sim- 
ilar behavior (data not shown). 




Regulator assignment entropy 



Figure 4: Cumulative distribution of precision as a 
function of regulator entropy for the data set with 
100 experiments and 150 modules: each point at an 
entropy value x (spaced at 0.01 intervals) gives the 
precision of all (blue) or top (red) regulators with 
assignment entropy < x. 



Performance of LeMoNe versus Genomica 

Next, we compared the performance of LeMoNe and 
Genomica [6,27]. Both programs heuristically search 
for an optimal module network and are therefore 
bound to end up at a (different) local maximum of 
the Bayesian score. We simulated 10 different data 
sets with 100 experiments for the same 1000 gene 
network as before and inferred a network with 150 
modules (corresponding to the point where the score 
function in Figure 1 starts to saturate). The average 
precisions are 0.196 ±0.015, resp. 0.155 ±0.013, and 
average recalls 0.255 ± 0.016, resp. 0.381 ± 0.021, for 
LeMoNe, resp. Genomica. The average F-measure 
is 0.222 ± 0.015, resp. 0.220 ± 0.016. The similar- 
ity in performance at the level of the whole module 
network, with a bias for higher precision in LeMoNe 
and higher recall in Genomica, is further seen in Fig- 
ure 5, where we plot recall - precision pairs for both 
programs at different noise levels. For each of the 
plotted series, lower noise levels correspond to points 
in the upper right of the scries plot, and higher noise 
levels to points in the bottom left, illustrating a gen- 
eral decrease in performance for more noisy data. 

The average module overlap between the mod- 
ule networks generated by LeMoNe and Genomica is 
0.46±0.02. Both programs, although featuring simi- 
lar performance, attain a different local maximum of 



S 



the Bayesian score, and the differences in the corre- 
sponding module networks can be quite substantial. 

In general we can say that both module network 
inference programs suffer from a high number of false 
positive edges. When using LeMoNe, false positives 
can to some extent be filtered out by looking only at 
the highest levels in the regulation tree (Figure 3). 
To see whether this is also the case for Genomica, 
we calculated the recall and precision for the sub- 
networks generated by the top regulators alone (Fig- 
ure 5). The recall for these subnetworks is generally 
lower as they contain far fewer edges than the com- 
plete module network. For LeMoNe this decrease in 
recall is compensated by a large increase in preci- 
sion. For Genomica the decrease in recall is bigger, 
with only a slight increase in precision. There is no 
analogue of the assignment entropy in Genomica, so 
we cannot compare the gain in precision by imposing 
an entropy cut-off. 

One of the major differences in LeMoNe with re- 
spect to Genomica is the fact that the regulatory 
tree structures learned by LeMoNe are only depen- 
dent on the expression data inside the module, and 
not on the expression profiles of potential regulators. 
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Figure 5: Comparison of heuristic search methods 
by recall - precision pairs for data sets with 100 ex- 
periments and different noise levels, for the complete 
module network, and for the subnetwork generated 
by the top regulators in the regulation programs. 

We hypothesized that this might allow LeMoNe 
to better handle missing or hidden regulators, a sit- 
uation which might for instance occur if the true 
regulator is missing from the list of potential reg- 



ulators. In order to test this hypothesis, we sim- 
ulated 10 different data sets with 100 experiments 
for the same 1000 gene network and inferred module 
networks with 150 modules using both LeMoNe and 
Genomica. In each of the ten runs we randomly left 
out 20% of the potential regulators from the regu- 
lator list (i.e., we used 84 instead of 105 potential 
regulators). The average _F-measure of the result- 
ing networks is 0.183 ± 0.025 for LeMoNe, versus 
0.126±0.012 for Genomica. Compared to the results 
when taking into account all 105 potential regula- 
tors (see above), the performance drop for LeMoNe 
is clearly less pronounced (17.6%) than for Genom- 
ica (42.7%), indicating that LeMoNe is indeed better 
at handling missing regulators. 

Regarding the speed of LeMoNe versus Genom- 
ica, we can say that LeMoNe is considerably faster 
for larger data sets. This is mainly due to the fact 
that in LeMoNe the regulators need only be assigned 
to the regulation programs once, after the final con- 
vergence of the Bayesian score. This saves a con- 
siderable amount of time on scanning possible split 
values and performing acyclicity checks at each iter- 
ation. Roughly, LeMoNe and Genomica performed 
equally in terms of speed on the SynTReN data set 
containing 1000 genes and 100 experiments. On a 
real data set with 173 experiments [21], LeMoNe was 
about twice as fast as Genomica when limiting the 
number of genes to 1000, and ten times faster when 
considering the whole data set (2355 genes). 



Biological data 

For real biological data sets the underlying regula- 
tory network is generally not known (indeed, the 
primary purpose of module network learning algo- 
rithms is precisely to infer the regulatory network) 
and hence it is difficult to assess the quality of an in- 
ferred network. This is one of the main reasons why 
microarray data simulators such as SynTReN have 
to be used to validate the methodology. However, 
given the fact that data simulators seldomly capture 
all aspects of real biological systems, any results ob- 
tained on simulated data should be approached crit- 
ically and, where possible, validated on biological 
data sets. Here, we investigate to what extent mod- 
ule networks inferred from real expression data can 
be validated using structured biological information. 

For S. cerevisiae, there is partial information 
on the underlying network structure in the form of 
ChlP-chip data and promoter motif data [9], and 
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more profusely in the form of GO annotations [2]. 
We learned module networks for budding yeast from 
an expression compendium containing data for 2355 
genes under 173 different stress conditions [21] (the 
Gasch data set) using the same number of modules 
(50) and the same list of potential regulators as Se- 
gal et al. [6]. We then calculated the F- measure 
between the resulting regulatory network and the 
ChlP-chip network of Harbison et al. [9], consider- 
ing in the former network only regulators that were 
tested by ChlP-chip. In general, the resulting recall 
and precision values are substantially lower than for 
simulated data of the same size, namely 0.0195, resp 
0.0218. When looking at individual modules, only 
13 out of 50 regulatory programs feature at least 
one regulator that is to some extent confirmed by 
ChlP-chip data. In addition, we tried to relate the 
regulatory program of a module to the module's gene 
content in functional terms using GO annotation. 
Overall, only 8 out of 50 programs possess one or 
more regulators belonging to a yeast GOSlim Bio- 
logical Process category that is overrepresented in 
the module (considering only the leaf categories in 
the GOSlim hierarchy). Remarkably, only 3 of these 
8 programs overlap with the 13 regulatory programs 
featuring overlap with the ChlP-chip data. This ob- 
servation suggests that both data types can actually 
be used only to a limited extent to infer the quality 
of regulation programs. Indeed, many factors limit 
the use of ChlP-chip and GO data as 'gold stan- 
dards'. Both types of data are noisy and offer incom- 
plete information. For example, Harbison et al. [9] 
mainly profiled transcription factor binding in rich 
medium conditions, whereas the Gasch data set con- 
tains primarily stress conditions. The parts of the 
transcriptional network that are active under these 
conditions may substantially differ [9,10]. Moreover, 
the expression profile of a transcription factor is of- 
ten not directly related to the expression profile of 
its targets, for example due to post-translational reg- 
ulation of transcription factor activity. As a conse- 
quence, indirect regulators such as upstream signal 
transducers may feature in the regulation programs 
instead of the direct regulators, i.e., the transcrip- 
tion factors [6]. As for GO, many regulators appear 
not to be annotated to the GO Biological Process 
categories of their target genes. Taking these factors 
into account, the limited overlap with the available 
ChlP-chip and GO data does not necessarily reflect 
the quality of the inferred regulatory programs. 
On the contrary, we established that the regu- 



latory programs do in fact contain a considerable 
amount of relevant and potentially valuable informa- 
tion. Indeed, by manually investigating individual 
modules in more detail, we could in many cases qual- 
itatively relate the regulators to the module's gene 
content. For example, the module shown in Figure 6 
is enriched in a.o. genes involved in the main path- 
ways of carbohydrate metabolism (P = 1.0596E— 4), 
energy derivation by oxidation of organic compounds 
(P = 1.2046E-4) and alcohol biosynthesis (P = 
1.3185E— 2). None of the 5 regulators of this module 
could be related to the module's gene content based 
on ChlP-chip or GO information. However, based 
on their description in the Saccharomyces Genome 
Database (SGD) [28], all 5 regulators could be linked 
to glucose sensing or the response to (glucose) star- 
vation, processes that can arguably influence the ex- 
pression of carbohydrate metabolism genes. 




Figure 6: Sample module learned from the Gasch 
data set [21]. Red and green hues indicate upreg- 
ulation resp. downregulation. The pairs (x, y) un- 
der each split in the regulation tree represent the 
Bayesian score gain over the split, normalized on the 
number of genes in the complete network (ar), and 
the regulator assignment entropy (y). 

However, one must keep in mind that it remains 
impossible to infer complete and accurate regulatory 
networks from gene expression data alone. Expres- 
sion data only provides information on one regu- 
latory level, namely the transcriptional level. In- 
formation on (post-)translational regulation is lack- 
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ing. The current expression-based module net- 
work algorithms (e.g. [6], this study) try to rem- 
edy this problem by including signal transducers in 
the list of potential regulators in addition to tran- 
scription factors, in the hope to capture some of 
this non-transcriptional regulation from the expres- 
sion profiles of key signal transducers. However, this 
trick can only be expected to uncover a fraction 
of such non-transcriptional regulatory interactions, 
and moreover the direct targets of these regulatory 
interactions are not identified. A potential remedy 
for this shortcoming would be to include other types 
of data, such as data on protein expression levels 
and protein phosphorylation, in the module network 
learning framework. Unfortunately, such data are 
not yet available on a large scale. 

In summary, our results indicate that structured 
biological information such as ChlP-chip data or GO 
can not (yet) be used to measure the performance of 
module network algorithms in an automated way. 
This is a strong argument for using data simulators 
such as SynTReN for the purpose of developing, test- 
ing and improving such algorithms. 



Conclusions 

We developed a module network learning algorithm 
called LeMoNe and tested its performance on sim- 
ulated expression data sets generated by SynTReN 
[20]. We found that the Bayesian score can be used 
to infer the optimal number of modules, and that 
the inference performance increases as a function of 
the number of simulated experiments but saturates 
well below 1. 

We also used SynTReN data to assess the effects 
of the methodological changes we made in LeMoNe 
with respect to the original methods used in Ge- 
nomica [6]. Overall, application of Genomica and 
LeMoNe to various simulated data sets gave compa- 
rable results, with a bias towards higher recall for 
Genomica and higher precision for LeMoNe. How- 
ever, LeMoNe offers some advantages over the origi- 
nal framework of Segal et al. [6] , one of them being 
that the learning process is considerably faster. An- 
other advantage of LeMoNe is the fact that the algo- 
rithm 'lets the data decide' when learning the reg- 
ulatory tree structure. The partitioning of expres- 
sion data inside a module is not dependent on the 
expression profiles of the potential regulators, but 
only on the module data itself. As a consequence, 



the assignment of 'bad' regulators (in terms of as- 
signment entropy) to 'good' module splits (in terms 
of Bayesian score) might suggest missing or hidden 
regulators. This situation might occur if the true 
regulator is missing from the list of potential regu- 
lators, or if the expression of the targets cannot be 
related directly to the expression of the regulator, 
e.g., due to posttranslational regulation of the reg- 
ulator's activity. We have also shown that filtering 
the module network by the location of regulators in 
the regulation program or by introducing an entropy 
cut-off improves the inference performance. When 
inferring regulatory programs from real data, these 
criteria may prove useful to prioritize regulators for 
experimental validation. 

Finally, we explored the extent to which mod- 
ule networks inferred from real expression data could 
be validated using structured biological information. 
For that purpose, we learned module networks from 
a microarray compendium of stress experiments on 
budding yeast [21]. We found that the resulting reg- 
ulatory programs overlapped only marginally with 
the available ChlP-chip data and GO information. 
However, more detailed manual analysis uncovered 
that the learned regulation programs are neverthe- 
less biologically relevant, suggesting that an auto- 
mated assessment of the performance of module net- 
work algorithms using structured biological informa- 
tion such as ChlP-chip data or GO is ineffective. 
This underscores the importance of using data sim- 
ulators such as SynTReN for the purpose of testing 
and improving module network learning algorithms. 
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Precision and F-measure as a function of 
the number of modules and experiments 
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Figure SI: Precision as a function of the number 
of modules for data sets with 10 (magenta), 50 
(cyan), 100 (red), 200 (green), and 300 (blue) ex- 
periments. The curves are least squares fits of the 
data to a linear non-polynomial model of the form 
ao+J2k=i a,kX k ~ 1 e~~ x / 500 with x the number of mod- 
ules and n = 3. 

Module networks Bayesian score 

We use the same Bayesian score as in the original 
module networks formalism [6,30]. The data like- 
lihood is given by evaluating the module network 
joint probability distribution (eq. (2)) on the data 
set, assuming independent experiments, 



M K 

£= n n n Pk(x hm i {x^ m -. j & n fc }), 

m— 1 k— 1 i^Ak 

where X^ m is the log-normalized expression value of 
gene i in experiment m. 

The Bayesian score is obtained by taking the log 
of the marginal probability of the data likelihood 
over the parameters of the normal distributions at 
the leaves of the regression trees with a normal- 
gamma prior. It decomposes as a sum of leaf scores 
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Figure S2: F- measure as a function of the num- 
ber of modules for data sets with 10 (magenta), 50 
(cyan), 100 (red), 200 (green), and 300 (blue) ex- 
periments. The curves are least squares fits of the 
data to a linear non-polynomial model of the form 
ao + Y^k=i cikX k ~ 1 e~ x ^ 500 with x the number of mod- 
ules and n = 3. 

of the different modules: 



k k i 

SfcO^f) = log JJd^idTp(fi,T) Y[ Y\. P^r(xi, m ), 

(SI) 



where k runs over the set of modules and £ runs 
over the set of leaves of the regression tree of mod- 
ule k] Si denotes the experiments that end up at 
leaf I after traversing the regression tree and Ak 
denotes the genes assigned to module fc; p(/x,r) 
is a normal-gamma distribution over the mean fi 
and precision r of the normal distribution p p , T , i.e., 
p{fi,T) = p((j, | t)p(t) where p(r) ~ r(a Q ,f3 ) and 
f^M-A^o^or)- 1 ): 
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PO _ao-l--/? T 



r(a ) 



(S2) 

(S3) 
(S4) 



with a , A), Ao > and — oo < fi < °°- 

Insertion of eqs. (S2)-(S4) into eq. (SI) leads to 
an integral that can be solved explicitly as a function 
of the sufficient statistics 

nP= E E4- ? = o,i,2. 

me£e i^A k 

of the leaves of the regression tree. The result is 



S fe (^) = -±< } log(27r) + ilog( 



i d(1> 



-logr( ao ) + logr(ao + i< 
+ a log A, - (a + f i^) k>g/3i (S5) 



where 



ft = A) + g 



Wx2 



H 2 - 



r: 



+ 



2(A + < ) )< ) 



Learning module regulation programs 

The pseudocode for the regulation program learning 
algorithm is given in Figure S3. In its simplest form, 
the merge score for two trees T ai and T Q2 considers 
only the gain in Bayesian score that is obtained by 
merging two sets into one: 

r ai ,a 2 — Sk (£ ai U £a 2 ) - Sk (£ ai ) - Sk (£a 2 ) ■ (S6) 

In our computations we used a merge score which is 
slightly more complicated and takes into account the 
whole substructure of the tree below T ai and T a2 . 

Let T a be a tree with children T ai and T a2 , and 
define recursively 

Z a = e Sk{£a) 
with initial condition 



+ Z ai Z a 



oSk({m}) 



for the trivial tree with one experiment m and no 
children. The new merge score is then defined as 



= S k (£ ai U £ a2 ) - lnZ Ql - \nZ a 



(S7) 



A binary tree T a generates a nested set of parti- 
tions V a (we write this as V a ~ T a ) of its experiment 
set £ a and to each such partition corresponds a score 



Sfc(P Q ) = X;Sfc(£i) 



where £i are the subsets of £ a forming the partition 
V a . Since a partition generated by T a is either the 
singleton partition V a — {£ a }, or a combination of 
a partition generated by T ai with a partition gener- 
ated by T a2 , we get immediately 



Z n — 



J2 e s ^\ 



or 



lnZ a = S k (£ a )+ln(l+ ]T e W-S k (S a ^ 

V a ~T a 

(S8) 

We conclude that the merge score (S7) contains the 
score difference (S6) as well as other terms defined 
by the structure of the subtrees T ai and T Q2 . If two 
pairs of trees give the same score difference (S6), the 
merge score (S7) will typically favor to merge the 
pair with the smallest substructure first (as the num- 
ber of terms in the summation in (S8) is smaller). 
Hence, using (S7) instead of (S6) leads to more bal- 
anced trees. 

Since we are building the tree from the bottom 
up, computing the partition sums Z a is done along 
the way, and using the merge score (S7) instead of 
(S6) comes at no computational cost. The whole 
process of constructing the tree with a merge score 
depending on all the partitions generated by the sub- 
trees is very similar to the Bayesian hierarchical clus- 
tering method of [31]. 



Regulator assignment in the presence of 
missing values 

In real data there are often missing values, so for 
some experiments we do not know if a regulator is 
above or below a given split value. Using the non- 
missing values to define the sets 1Z\ and 7\L 2 , we 
compute qi as before. Since regulators with a lot 
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of missing values lead to more uncertainty, we pe- 
nalize those by moving the conditional probabilities 
Qi closer to the maximum uncertainty value of \ by 
defining 

^ = ( 1 -H)(ft-2) + 2- 

where 

7^3 = {m e £ a : x r ^ m is missing}. 

Note that when there are no missing values, q[ = qi , 
and in the extreme case where there are only missing 
values, q\ = 2 . For the probability distribution of R, 
we distribute the missing values proportionally over 

/* Find hierarchical tree */ 
Input: A list treeList of trivial trees 

representing single experiments, 
while treeList has more than 1 element do 

compute r ai . a2 for each pair of trees in 

treeList; 

construct the joined tree T a = T ai U T a2 
for the pair with highest r aiiQ2 ; 
_ add T a to treeList and remove T ai , T a2 ; 
Output: A single tree T representing all 
experiments. 

/* Find optimal regulation program 

leaves */ 
testScore (T .root) ; 

/* Recursive procedure to cut the 

hierarchical tree */ 
Begin testScore(node) 

if Sk(node) < Sk (node, left Child) + 
Sk (node, right Child) then 
testScore(node.leftChild); 
testScore(node.rightChild); 
else 

|_ cut tree below node; 

End 



1 and 2, 

Vl ~ |f a I 

such that p'i +p' 2 still sums up to 1 . We now minimize 
the conditional entropy 

H{E\R)=p' l h{q[)+p' 2 h{q l 2 ) 

corresponding to these modified probability distri- 
butions. For the sufficient statistics of the leaves of 
the module, we simply ignore the missing values as 
there are typically more than enough combined data 
points for a reliable computation of those statistics. 

The complete regulator assignment algorithm is 
given in Supplementary Figure S4 

/* Assign regulators separately for 
different regulation tree levels. 
*/ 

/* Level is the distance from a node 
to the root. */ 

for each level I do 

create a list nodeList with all nodes at 
level I in the trees of all modules; 
while nodeList is not empty do 
for each node in nodeList do 

for each regulator r in the set of 
potential regulators that do not 
break acyclicity do 

compute the entropy for 
|_ assigning r to node; 

find the node-regulator pair 
(bestNode, bestR) with least 
entropy; 

assign bestR to bestNode; 
|_ remove bestNode from nodeList; 

Figure S4: Pseudocode for the regulation program 
learning method 



Figure S3: Pseudocode for the regulation program 
learning method 
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